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Some Handy Links 

• https://github.com/internetarchive 

• https://www.zotero.org/groups/internet_archive_- 
_open_apis_and_examples/items 

• https://archive.org/details/oscon2015-ia-apis 



The Internet Archive github organization 



* The Zotero group where you can find all of the links and references we'll be mentioning 
Don't worry, we'll be showing these links again at the end of the presentation. 



A Tour of Internet 
Archive 



/Internet Archive 




501(c)(3) non-profit, founded in 1996 
Registered library in California, USA 
Partnerships with hundreds of libraries, 
museums & universities 
24 petabytes of unique data 
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$12 million per year budget 
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Photo: Jason Scott 



Archived web sites 

400+ billion web captures, 1996 to present 
Wayback Machine updated within hours 
700,000 people per day 
80TB crawl open for bulk download 
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How do we decide what to archive? 
People pay us 

Organizations donate crawled content 
We crawl on our own behalf 

Deep crawl on popular sites 

Broad but shallow crawl on known domains 

Targeted crawls 
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LIBRARY 




n Library is yours 

to borrow, read & browse. 




Books to Read t*« ma c t*s ** ihh * r^rto* o*f 
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Books to Borrow iim •» ■ .«-pi« or 



Borrow books at openlibrary.org 
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r Video ^ 
2,000,000 items 

Feature films, documentaries, commercials, 

propaganda, stock footage, cartoons 

Archive 60 channels of TV 24 hours per day 1 



TVNCWS VC 



0 AJ Jazeera America I News | '^SSSS,^ 



»*btt 



dty»ar*«Mft»1o 

cent fat ogr 




about 300 young women 
ages 16 to 18 were told 
they would be taken 

■H I M I ■ I D |J Mi 1 

dressed m police uniforms 
but the men were actually 
members of boko haram 
we know little about the 

girls, only some of their 



ior#rhomti 



HA 



«ndrtt% 



U.S. TV News Archive: archive.org/tv 




IUMA: Internet Underground Music Archive 

We are working on building a large collection of commercial music 




Software: 10k cdroms, 19k titles at Stanford, need clarification on DMCA 



Why? 



We are good at storing and serving 
digital media and preserving it 
We care about the same things: 
knowledge, keeping information open, 
privacy 

We fight for what we care about 
We're not slick, but we are friendly! 



Internet Archive APIs 




Let's set up some expectations. Internet Archive provides many different ways to access and contribute to those 18+ Petabytes of data. There's a lot to cover, so I'll only be giving you a brief overview of 
each rather than a deep dive. Complete information can be found at the Zotero link. 



Wayback Machine API 

http : //archive . org/help/wayback api . php 



Wayback Machine API 

• Is a URL archived? 

• If so, is it available in the Wayback Machine? 



This API is a study in simplicity & ease: exactly what you need w/o clutter. 



Wayback Machine API 



http : //archive . org/ wayback/ available? 
PARAMETERS 



Simple URL-based API 

Only three possible parameters (url, timestamp, callback) and two are optional 
Returns JSON 



Wayback Machine API 



r 




Parameters : 




url=sub . domain . tld 

times tamp=2 0030 831 06042 9 (YYMMDDhhmmss ; 

callback=function name (optional) 


optional) 




j 



Parameters for the Wayback API 

* url = what you're looking up (no protocol; http, etc.) 

* timestamp = YYYYMMDDhhmmss; YYMMDD OK; Will return closest snapshot to this timestamp 

* callback = for jsonp 
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Solar System Exploration 




New & Events 



SS€ Horn* > Hirtets 
Planets 




Our Solar System 




A 

Our 
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http://solarsystem.nasa.gov/planets/index.cfm 



We've had a bit of excitement recently about New Horizons & its fly-by of Pluto, so let's use that as the basis for our examples today. 



Wayback Machine API 



r 




curl http://archive.org/wayback/ 




available ?ur l=solar sy s tem. nasa . gov/ 




planets/profile . cfm? 




Ob j ect=Dwarf &Display=Sats 









Let's see whether the Wayback has their site on record yet. Just throw this to curl and see what it returns. 



Wayback Machine API 



That's more like it. 



archived_snapshots : { 
closest: { 

available : true , 

ur 1 : "http : / /web . archive . org/web/ 
20150711221849/https : // 

solarsystem . nasa . gov/planets /profile . cf m? 
Object=Dwarf " , 

timestamp: "20150711221849" , 

status: "200" 

} 

l' 1 j 



Wayback Machine API 



r 




{ "archived snapshots" : { } } 









This is what you see when there is no entry for the URL in the Wayback. This API is in use in some pretty interesting places. One example is: 



404 Handler 

Ft— -404 F#a Not FoorxT Handler for WtomMt^t to Improve 
Umt Exportonco 

jtMw 24. A' J *» i ■ wtmttnm 

Thr Internet Arritav* today U launching a free terrier to 
hc%> wtwTiiifttf i» imprcnT thctr user cepmoxt by 
iMfntftatin* their weharle'ft 404 Pag* Xot Found page to 
ttnktotteWarbar* M*co» in thr cut tfcit it has it. 
Therefor* user* trying to get to any page* that Might: 
timbtenooar^ious%*r*knof yourw*bafte«rftl 
no* he |iwfl the option to go to the \V*> tut a M jih_re. 

To embed a link to tee Wayfaack Machine 00 your site's ao* paay*. Just 
in etude this line in your error p^tr' 




https://blog.archive.org/2013/10/24/web-archive-404-handler-for- 

webmasters/ 




The 404 Handler. Placed on your 404 page, if an archived version of the page is in the Wayback then it'll offer a link to it. 



Open Library API 



https : //openlibrary . org/developers/api 



Open web page for every book ever published. Think of it rather like Wikipedia for books. The library you can edit. 
A full RESTful API. 

Can return Open Library bibliographic & holdings records in JSON and RDF formats 



Open Library API 

• Query the Open Library database 

• View record information 

• Edit record information 

• View record history 



Features: most everything you can do on the Open Library website: Query the DB, look at information for books & authors, edit records and view edit history... 



Open Library API 



Let's see what Open Library has for the subject "pluto". I'll limit it to just one entry so it'll fit on the next slide. 



Open 



works : [ 
{ 

printdisabled: false, 
cover_id: null, 
ia_collection : [ ] , 
has_fulltext : false, 
edition_count : 2, 
checked_out : false , 
title: "The planets", 
public_scan: false, 
lendinglibrary : false, 
lending_edition : " " , 
overdrive : " " , 
f irst_j?ublish_year : null, 
key: Vworks/OL2715443W" , 
authors : [ 
{ 

name: "Heather Couper" , 
key: "/authors/ 
OL397296A" 
>, 



So I see there's "The Planets" by Heather Couper 



& Nigel Henbest. 



{ 

name: "Nigel Henbest", 
key: "/authors/ 
OL449922A" 
} 

] , 

ia: null, 

lending_identif ier : " " , 
subject: [ 

"Solar system", 

"Juvenile literature" , 

"Planets" 

] 

} 

] , 

subject_type : "subject", 
work_count : 13 , 
key: "/subjects/planets", 
name: "planets" 



Evergreen 


EVE irjSfc KEEN 








Evergreen Documentation Get Involved! 

Downloads c— Qmmmmm u^tAfCMim 

fMOMMstitn rat 




-j« en ca=n 




http://evergreen-ils.org/ 



An example of this API in the wild: The Evergreen 



open source Integrated Library System uses the API to retrieve book covers and other information from OL. 



Do-We-Want-lt? API 

http : //want . archive . org/ 



So, how do all of those books get into Open Library? Well, a lot of them are donated by people like you. The Archive will accept your spare book, scan it & make it available online, and then save the book 
itself in its physical archive. But space is limited so they've provided a simple API to help you see whether they have a copy of that book yet. 



Do-We-Want-lt? API 



http : //want . archive . org/api?isbn=<isbnlO or isbnl3> 



This API works with anything which has an ISBN. It takes one argument and returns very easy to understand JSON. 



Do-We-Want-lt? API 




http:// 
want . archive . org/api? 
isbn=978-0393350395 



So let's see whether the Archive needs any copies of this book by Neil DeGrasse Tyson & Donald Goldsmith. 



Do-We-Want-lt? API 



status: "success", 
result: "1", 

description: "want_f or_ia_pa" , 
identifier: 



Yup! We can see here that the Archive would like 



to have a copy of this book. It doesn't yet have it available. 



Do-We-Want-lt? API 



Response keys : 
status 

fail - We failed to process the request. The submitted ISBN was 
invalid. 

success - The request was serviced successfully, 
result 



-1 - 


We failed to process the request. The 


submitted ISBN was 




invalid. 










0 - 


The request was serviced successfully, 
and do not want more. 


but 


we 


have 


two copies, 


1 - 


The request was serviced successfully, 
We want it. 


and 


we 


have 


no copies . 


2 - 


The request was serviced successfully, 
already. We want it. 


and 


we 


have 


1 copy 



description - "human -par sable" description, with respect to the above: 
failure 
do_not_want 
wan t_f o r_i a_pa 
w a n t_f o r _a 1 t_j? a 

identifier - String when result = 2 . This is the identifier already 
assigned on the cluster for that ISBN 

-1 - No identifier designated for this ISBN. 

<string> - Designated identifier for this ISBN. 



The results are pretty easy to read, but here you can see all of the possibilities for data returned. 



IA Search API 



https : //archive . org/advancedsearch . php 



As Alexis has pointed out, the Archive has ALL THIS GREAT STUFF. But how do you find it? This isn't a documented API so much as an easily extrapolatable URL format. 



IA Search API 



Advanced Search 



This form allows you to perform an advanced search. You only need to fill m one fMd below. This can be 
any field. If you select "not" as your match criteria, you must select one other field. 

Any fiefct 



AND 

AND 

AND 

AND 
AND 
AND 

AND 

AND 

AND 
AND 



corUrt 



Title: 
Creator 
Description? 
Coiection: 



Castor fekl Q 
Custom MM Q 
Date. 

MM imgi 



it 
li 



s 



Q MM Q 00 Q 

B *m B 00 B TQ yYYV B uu B 00 B 



So let's continue my quest to learn more about pi 



lanets, performing a search for all texts with a subject value of 'pluto.' 



IA Search API 



NASA Technical 
Documents 



12 



i: 

PU/TO*UNfT)1t 



UMNUlfUlfT)9 
«UtO0t 

vt 



The search terms: 
mediatype:(texts) AND 
subject:(pluto) 

https:// 
archive . org/ 
search. php? 
que r y =me di a type % 3 A 
%28texts%29%20AND 
%20subject%3A 
%28pluto%29 



It returns this URL. It has some HTML entity encoding going on, but otherwise makes it pretty obvious how to build a URL. OK, that's pretty cool, but. 



IA JSON API 



Advanced Search returning JSON, XML, 
and more 

nmw* +*+<m** * 9m ****** rm* 




https : //archive . org/ 

advancedsearch . php? 

q=media type % 3A 

%28texts%29+AND 

+subject%3A%28pluto 

%29+AND+collection%3A 

%28nasa%29&fl%5B 

%5D=identif ier&sort 

%5B%5D=&sort%5B 

%5D=&sort%5B 

% 5D=&rows=5 0 &page=l &o 

utput= j son&callback=c 

allback 



...the fact that I can ask for the exact same data to be returned as JSON (or CSV, or XML, or...). As you can see, this URL is a bit busier, but comparing it agai 
on. 



nst the form it's quite easy to see what's going 



IA JSON API 



callback ( 
{ 

responseHeader : { 
status : 0 , 
QTime: 108, 
params : { 

json.wrf: "callback", 

wt: "json", 

rows: "50", 

qin: "mediatype : (texts) AND subject : (pluto) " , 
f 1 : "identifier", 
start: "0", 

q: "mediatype: (texts ) AND subject: (pluto )" 

} 

}, 

response : { 
numFound: 32, 
start: 0, 
docs : [ 

{ 

identifier: "nasa_techdoc_2 0050 060 913" 

}, 



] 

} 



} 



) 



But it's even easier to see when you view the JSON output. Note the identifier. 



IA JSON API 

• Lucene-based 

• Grouping 

• Fuzzy queries 

• Relevance boost 

• Date ranges 

• Etc. 



You can do some pretty sweet things here. But wait, there's more! 



I A Metadata API 



http : //blog . archive . org/2013/07/04/metadata-api/ 



The Internet Archive Metadata API allows you to 



DATA MINE that entire 1 8+ Petabytes of data. And it's WAY faster than it has any right to be. 



I A Metadata API 



http : //archive . org/metadata/ 
nasa techdoc 20050060913/metadata 



I want to learn more about that item I retrieved with the JSON API, so let's call the Metadata API on its identifier. To make the response slightly shorter, I'm g 
of the information, rather than ALL of it. 



loing to limit it to just the most metadata-est part 



IA Metadata API 




r i 

result: { 






identifier: "nasa techdoc 20050060913", 






date: "2004", 






description: "Terra MODIS 250 m observations are being applied to a Suspended Sediment Concentrat 






algorithm that is under development for coastal case 2 waters where reflectance is dominated by sedin 






entrained in major fluvial outflows. An atmospheric correction based on MODIS observations in the 50C 






resolution 1.6 and 2.1 micron bands is used to isolate the remote sensing reflectance in the MODIS 25 






resolution 650 and 865 nanometer bands. SSC estimates from remote sensing reflectance are based on ac 






inherent optical properties of sediment types known to be prevalent in the U.S. Gulf of Mexico coasta 






present our findings for the Atchafalaya Bay region of the Louisiana Coast, in the form of processed 






over the annual cycle. We also apply our algorithm to selected sites worldwide with a goal of extendi 






utility of our approach to the global direct broadcast community.", 






document- source : "CASI", 






documentid: "20050060913", 






nasa-center: "Goddard Space Flight Center", 






online-source : "http : //wayback . archive-it . org/1792/20100127084754/http : //hdl . handle . net/2060/2005 






original-nasa-rights : "Unclassified; No Copyright; Unlimited; Publicly available; Progress Report 






title: "Estimating Coastal Turbidity using MODIS 250 m Band Observations", 






updated-added-to-ntrs: "2008-06-02" , 






year: "2004", 






collection: "nasa_techdocs" , 






contributor : "NASA" , 






language : "eng" , 






licenseurl : "http : //creativecommons . org/licenses/publicdomain/" , 






mediatype : " texts " , 






rights: "Public Domain", 







And we get more JSON. There's a LOT of data here and I've truncated the output here. 



I A Metadata API 



It can write metadata, too! 



If I had the authorizations to change this item's metadata. Which I don't. But if you do, then you can if you want. wOOt. 



But Wait! 
There's more! 



So right now you're probably thinking, "Sure, all this mining of 1 8+PB of data is neat and all, but how do I add to it?" 



IAS3 API 



https : //github . com/vmbrasseur/IAS3API 



This is the big daddy: The Internet Archive S3-like API. And now you know why I'm up here speaking to you today: I'm the maintainer of the documentation for this API, which you can find at this GitHub 
URL 

Reminder: You can upload ANYTHING to IA. For free. As much as you want. They'll serve it up & preserve it forever. For free. 



IAS3 API 



Create items in Internet Archive, upload 
files to those items, maintain the metadata 

for the items, and download from any 
publicly-available Internet Archive items. 



Doesn't work for Open Library, but otherwise? Much of that stuff I just showed you? Can be handled with this one Big Daddy of an API. 



IAS3 API 



Drop-in replacement for the Amazon S3 API 

Pick your favorite S3 library/client, change the 
server to s3.us.archive.org, and you're good to go. 



This is a pretty involved API, so I'll only provide one brief and simple example. There are more in the documentation. 



IAS3 API 









curl 


--location --header 'x-amz -auto-make-bucket : 1 f \ 




- 


-header f x-archive-metaOl-collection : nasa techdocs 


■ \ 


- 


-header f x-archive-meta-mediatype : movie f \ 




- 


-header \ 




r 


x-archive-meta-ti tie : Pluto Fly By f \ 




- 


-header "authorization: LOW $accesskey : $secret" \ 




- 


-upload-file new-horizon .mp4 \ 




http: 


//s3 . us . archive . org/pluto-new-horizon/new-horizon . 


mp4 






w 



Create a new item (aka bucket) on Internet Archive with the identifer pluto-new-horizon, Assign the item to the 'data' mediatype, then upload the file new-horizon. mp4 to the item 
Can also download, change metadata, etc. A lot of people & organizations use this API, so I'll only highlight a very few of them. 



RECAP & Global Public 
Safety Codes 




https : //archive . org/details/usf ederalcourts https : //archive . org/details/publicsaf etycode 



RECAP from Aaron Swartz & Global Public Safety 



Codes from Carl Malamud to free otherwise locked up public information. 




But I MO, the most exciting use of IAS3API is by NASA. 



Life is short. 
What if I don't want to 
learn S3? 



Sure, you think that's cool & all, but your time is valuable and you really don't want to spend it learning S3? OK, we can work with that. 



ia-wrapper 

https://github.com/jjjake/ia-wrapper 



We have Jake Johnson at the Archive to thank for this little wonder. This is a Python wrapper around IAS3 and the metadata and search APIs. It includes utilities for everything you want to do, without all the 
mess of wrangling S3 API headers. As if that weren't good enough... 



iadownload 
^^^^^^^^^^^^^ 



#! /usr/bin/env python 

# iadownload: Download all files in a collection or item 

# Copyright 2014 VM Brasseur 

import os 
import sys 

import internetarchive 
import pprint 
import argparse 
import json 



https://github.com/vmbrasseur/iadownload 




It also includes a Python library, so you can build your own utilities and services. As you can see, I use this myself. Not only do I use it for downloading. 



iaupload 




https://github.com/vmbrasseur/iaupload http://archive.org/details/sfperlmongers 



I also use it to upload. I'm the organizer of the San Francisco Perl Mongers user group. We now record all of our events and upload the videos to the Archive 
material. To really put ia-wrapper through its paces we need to look to... 



for all to see. But we don't really have a ton of 



Saving All The Things 






Source: Mirka23 on Flickr 



Jason Scott. Internet Archive employee, Founder of Archive Team, and activist computer archivist. As of the writing of this talk, Jason has uploaded just shy of 300K items to the Archive. Most of the items 
contain several files. Jason uses and swears by ia-wrapper to help him archive as much of computer history as inhumanly possible. 




As you can imagine, there's a WHOLE lot more which could be said on this topic. But now you at least know where to start looking for more information. Let's recap: 



Recap 

• WaybackAPI 

• Open Library APIs 

• Do-We-Want-lt? API 

• Search/JSON APIs 

• Metadata API 

• IAS3 API 

• ia-wrapper 



And, don't forget, if you want to learn more about any of these things. 



Those Links Again... 

• https://github.com/internetarchive 

• https://www.zotero.org/groups/internet_archive_- 
_open_apis_and_examples/items 

• https://archive.org/details/oscon2015-ia-apis 



As promised, here are those links again. Snap a picture or find the slides at the IA item (last link). We have one more important link to share with you. 



Donate to Internet Archive! 



http : / /archive . org/ donate/ 



Your support helps us build amazing services and 



keep them free for people around the globe. 



THANK YOU! 

Internet Archive is HIRING! (SF or Remote) 

• Senior Wayback Machine Engineer 

• Senior Dev Ops Engineer 

• Senior Cluster Storage & Computing Engineer 

• Senior Python Engineer 

■ Web Application Developer 

We need programmers to help us change (and preserve) 

the world! 

VISIT US 
Free lunch Fridays at noon 
300 Funston Ave 
San Francisco 

alexis@archive.org 



